Smart Bin Optimized Waster Collection Route¶

Authored by: Sachitha Sadeepa Kasthuriarachchi

Duration: 180 mins

Level: Beginner/Intermediate

Pre-requisite Skills: Python, Jupyter Notebooks, Power BI and Data Analysis

Scenario:

In Melbourne, managing urban waste efficiently is paramount for promoting cleanliness, minimizing environmental degradation, and safeguarding public health. The introduction of smart bin technology, which facilitates real-time monitoring of waste levels, presents an innovative opportunity to overhaul the city's waste collection methodologies. This initiative focuses on refining waste collection routes by utilizing sophisticated data analytics, including real-time data on bin fill levels and precise location coordinates of smart bins. Although some of this data can be accessed through the client's open data portal, incorporating additional datasets, such as real-time traffic conditions, may necessitate partnerships with local government bodies or external data providers. The goal of this project is to streamline waste management processes in Melbourne by optimizing collection routes, thereby curtailing fuel usage, alleviating traffic congestion, and boosting overall operational efficiency. This strategic approach not only enhances service reliability but also contributes significantly to the city's environmental and economic sustainability.

User Story:

As the waste management authority in Melbourne, I aim to streamline waste collection routes for smart bins to enhance waste collection efficiency. My goal is to minimize fuel consumption, reduce traffic congestion, and cut operational costs.

Acceptance Criteria:

  1. Efficient Route Planning: The solution should provide optimized routes for waste collection vehicles to pick up waste from smart bins across different areas of Melbourne.

  2. Real-time Data Integration: The system should integrate real-time data from smart bins to dynamically adjust collection routes based on current fill levels and demand patterns.

  3. Minimize Environmental Impact: The optimized routes should aim to reduce fuel consumption and carbon emissions by minimizing travel distance and time.

  4. Cost-effectiveness: The optimized routes should minimize operational costs associated with waste collection, including labor, fuel, and vehicle maintenance.

  5. Scalability: The solution should be scalable to accommodate the growing number of smart bins and evolving waste collection needs in Melbourne.

  6. User-Friendly Interface: The interface should be intuitive and user-friendly, allowing waste management authorities to easily visualize and manage waste collection routes.

What this use case will teach you

At the end of this use case you will learn:

  • Route Optimization: Ability to develop algorithms and methodologies for optimizing waste collection routes, considering factors such as distance, time, and resource utilization.

  • Data Integration: Proficiency in integrating real-time data from smart bins into route planning algorithms, ensuring accurate and timely decision-making.

  • Environmental Impact Analysis: Capability to assess and mitigate environmental impacts of waste collection operations, including reducing fuel consumption and carbon emissions.

  • Cost Optimization: Skills in cost analysis and resource allocation to minimize operational expenses associated with waste collection, including labor, fuel, and vehicle maintenance.

  • Scalability Planning: Ability to design scalable solutions to accommodate the growing number of smart bins and evolving waste management needs in urban environments.

  • User Experience Design: Proficiency in designing user-friendly interfaces for waste management authorities, ensuring ease of use and efficient management of collection routes.

1. Importing Data¶

Dataset link Smart bins Argyle Square

1.0 Dataset Imported through API¶

In [1]:
import requests
import pandas as pd
import os

def fetch_data(base_url, dataset, api_key, num_records=99, offset=0):
    all_records = []
    max_offset = 9900  # Maximum number of requests

    while True:
        # maximum limit check
        if offset > max_offset:
            break

        # Create API request URL
        filters = f'{dataset}/records?limit={num_records}&offset={offset}'
        url = f'{base_url}{filters}&api_key={api_key}'

        # Start request
        try:
            result = requests.get(url, timeout=10)
            result.raise_for_status()
            records = result.json().get('results')
        except requests.exceptions.RequestException as e:
            raise Exception(f"API request failed: {e}")
        if records is None:
            break
        all_records.extend(records)
        if len(records) < num_records:
            break

        # next cycle offset
        offset += num_records

    # DataFrame all data
    df = pd.DataFrame(all_records)
    return df    

API_KEY = os.environ.get("API_KEY ")
print("API Key:", API_KEY)
BASE_URL = "https://data.melbourne.vic.gov.au/api/explore/v2.1/catalog/datasets/" 
API Key: None

1.2 Fetching Smart Bin Sensor Data¶

This code initiates the process of gathering data for the project by accessing a specific dataset named smart-bins-argyle-square, which consists of sensor data from smart bins. The main steps involved are:

  1. Define Dataset Name: Assign the name of the dataset to the variable Bin_Sensor_Data.
  2. Fetch Data: Utilize the fetch_data function, providing it with the base URL (BASE_URL), the dataset name (Bin_Sensor_Data), and an API key (API_KEY) for authorization. This action retrieves the dataset from the specified source.
  3. Preview Data: Apply the .head() method to the resulting DataFrame data to display the first few entries. This step is crucial for a preliminary examination of the data structure, including column names and initial values.

This sequence of operations is pivotal for starting our data analysis, allowing us to understand the dataset's structure and prepare for in-depth exploration.

In [2]:
Bin_Sensor_Data = 'smart-bins-argyle-square'

data= fetch_data(BASE_URL, Bin_Sensor_Data , API_KEY)

data.head(3)
Out[2]:
time serial sensor_name status_current_fill_level bin_status last_update product_type description address latlong wastebasket_size
0 2021-06-29T21:18:25+00:00 FB1000001905AA05 None 14 Low None CleanFLEX Argyle - FB1000001905AA05 Argyle Square, 13-19 Argyle Pl N, Carlton VIC ... {'lon': 144.966267, 'lat': -37.802983} None
1 2021-06-30T12:17:40+00:00 FB1000001905AA15 Internal Bin Sensor 2 49 Half Full None CleanFLEX Argyle - FB1000001905AA15 138-146 Cardigan St, Carlton VIC 3053, Australia {'lon': 144.9655, 'lat': -37.802017} None
2 2021-07-01T00:00:21+00:00 FB1000001905AA17 External Bin Sensor 1 100 Full None CleanFLEX Argyle - FB1000001905AA17 164 Lygon St, Carlton VIC 3053, Australia {'lon': 144.9661, 'lat': -37.803217} None

2.0 Pre processing Data¶

2.1 Identifying Missing Values in the Dataset (part 1)¶

To ensure the integrity and quality of our analysis, it's crucial to check for missing values within our dataset. The code performs the following steps:

  1. Calculate Missing Values: Using data.isnull().sum(), we calculate the total number of missing values in each column of our dataset. The method isnull() identifies missing values across the DataFrame, and sum() aggregates these by column, providing a count of missing values per column.
  2. Display Missing Values: We then print a statement to notify the user that we're checking for missing values, followed by printing the results stored in missing_values. This output gives us a clear picture of which columns contain missing data and how many missing entries each has.

Understanding where our data is incomplete is crucial for deciding on subsequent data cleaning and imputation strategies, ensuring the reliability of our analysis or model predictions.

In [3]:
missing_values = data.isnull().sum()
print(" Checking for the Missing Values ")
print(missing_values)
 Checking for the Missing Values 
time                            0
serial                          0
sensor_name                  2889
status_current_fill_level       0
bin_status                      0
last_update                  9999
product_type                    0
description                     0
address                         0
latlong                      1808
wastebasket_size             9999
dtype: int64

2.2 Optimizing Data Types for Analysis¶

To ensure our dataset is primed for analysis, we undertake a comprehensive data type optimization process. This involves adjusting the data types of various columns to more accurately reflect their content and optimize memory usage. Here's an overview of the steps taken:

  1. Timestamp Conversion: Convert the time and last_update columns to datetime objects to facilitate time series analysis.
  2. String Specification: Change serial, sensor_name, description, and address columns to the string data type for consistency in text data handling.
  3. Category Optimization: For columns like bin_status and product_type, we use the category data type when we have a limited set of possible values, enhancing memory efficiency and performance during grouping operations.
  4. Numerical Adjustments: Ensure numerical data, such as wastebasket_size, is correctly typed (float) for accurate mathematical computations.
  5. Geolocation Handling: Split the latlong column into separate latitude and longitude columns and convert them to float data types, dropping the original latlong column to simplify geographical analysis.

These adjustments are crucial for the following reasons:

  • Efficiency: Optimizing data types can significantly reduce memory usage, making data manipulation more efficient.
  • Accuracy: Ensuring data types match the nature of the data prevents errors during analysis.
  • Convenience: Having the right data type (e.g., datetime) simplifies operations like date range filtering or time-based aggregation.

By refining our dataset's structure through these changes, we lay a solid foundation for robust and error-free data analysis.

In [4]:
# Let's change the datatypes to more appropriate types:
data['time'] = pd.to_datetime(data['time'])
data['serial'] = data['serial'].astype('string')
data['sensor_name'] = data['sensor_name'].astype('string')
# 'status_current_fill_level' is already an integer, no change needed.
data['bin_status'] = data['bin_status'].astype('category')  # Or 'string' if it's not a finite set of statuses.
data['last_update'] = pd.to_datetime(data['last_update'])
data['product_type'] = data['product_type'].astype('category')  # Or 'string' if it's not a finite set of types.
data['description'] = data['description'].astype('string')
data['address'] = data['address'].astype('string')
# 'latlong' will be split into two separate columns for latitude and longitude, and converted to float.
##data.drop('latlong', axis=1, inplace=True)
# Assuming 'wastebasket_size' refers to a numerical measurement and not a category.
data['wastebasket_size'] = data['wastebasket_size'].astype('float')  # Or 'category' if it's not numerical.
# Now, we handle the 'latlong' column assuming it's a string of 'latitude,longitude'.
# We will check if the DataFrame is not empty and if 'latlong' contains comma-separated values.
if not data.empty and 'latlong' in data.columns and data['latlong'].str.contains(',').any():
    # Split 'latlong' into two new columns 'latitude' and 'longitude' and convert them to floats.
    latlong_split = data['latlong'].str.split(',', expand=True)
    data['latitude'] = pd.to_numeric(latlong_split[0], errors='coerce')
    data['longitude'] = pd.to_numeric(latlong_split[1], errors='coerce')
    data.drop('latlong', axis=1, inplace=True)
# Display the DataFrame with new datatypes
data.dtypes
Out[4]:
time                         datetime64[ns, UTC]
serial                            string[python]
sensor_name                       string[python]
status_current_fill_level                  int64
bin_status                              category
last_update                       datetime64[ns]
product_type                            category
description                       string[python]
address                           string[python]
latlong                                   object
wastebasket_size                         float64
dtype: object

2.3 Improving the Detail of Time-Related Data¶

This segment focuses on refining our dataset's temporal data by splitting the 'time' column into distinct 'date' and 'time_only' columns. This transformation aids in detailed time-based analysis and enables more flexible data manipulation. Here are the key steps involved:

  1. Datetime Parsing: Initially, we ensure the 'time' column is correctly parsed into datetime format with pd.to_datetime(data['time'], errors='coerce'). The errors='coerce' parameter converts any parsing errors into NaT (Not a Time), preventing data corruption.

  2. Column Creation: Post successful datetime conversion, we proceed to split the datetime column:

    • The date column is created to store only the date component of each entry.
    • The time_only column is created to store only the time component.
  3. Error Handling: In cases where datetime parsing results in all null values (indicating parsing issues), both date and time_only columns are set to None. This precaution ensures the dataset's integrity remains uncompromised.

  4. Previewing Changes: Finally, we display the first few rows of the newly created date and time_only columns to verify the split was successful and to provide a glimpse into the data structure post-transformation.

This enhancement is pivotal for conducting in-depth temporal analysis and offers the flexibility to analyze trends and patterns based on specific dates or times.

In [5]:
# Splitting the 'time' column into separate 'date' and 'time' columns, this time ensuring to handle any potential issues
data['time'] = pd.to_datetime(data['time'], errors='coerce')  # Parsing the datetime to ensure correct format

# If datetime parsing was successful, create separate 'date' and 'time' columns
if not data['time'].isnull().all():
    data['date'] = data['time'].dt.date
    data['time_only'] = data['time'].dt.time
else:
    data['date'], data['time_only'] = [None, None]

placeholder_output_structure_2 = data[['date', 'time_only']].head(3)
placeholder_output_structure_2 
Out[5]:
date time_only
0 2021-06-29 21:18:25
1 2021-06-30 12:17:40
2 2021-07-01 00:00:21

2.4 Identifying Unique Sensor Serial Numbers¶

An essential step in our data analysis process involves identifying the unique serial numbers of the sensors within our dataset. This enables us to differentiate and potentially name each sensor for more detailed and personalized analysis. Here's the procedure:

  1. Retrieve Unique Serials: We use data['serial'].unique() to extract an array of unique serial numbers from the serial column of our dataset. The .unique() method is instrumental in identifying all distinct sensors present, ensuring no duplicates are considered.

  2. Displaying Unique Serials: The resulting array of unique serial numbers is then displayed. This array is crucial for subsequent steps in our analysis, such as naming sensors, analyzing data on a per-sensor basis, or grouping data for sensor-specific insights.

This step is fundamental in understanding the diversity and range of sensors within our dataset, laying the groundwork for targeted sensor-specific analyses and operations.

In [6]:
missing_values_count = data['sensor_name'].isnull().sum()
print(f"Number of missing values in column_name: {missing_values_count}")
Number of missing values in column_name: 2889
In [7]:
unique_serial_numbers = data['serial'].unique()
unique_serial_numbers
Out[7]:
<StringArray>
['FB1000001905AA05', 'FB1000001905AA15', 'FB1000001905AA17',
 'FB1000001905AA07', 'FB1000001905AA18', 'FB1000001905AA19',
 'FB1000001905AA11', 'FB1000001905AA13', 'FB1000001905AA02',
 'FB1000001905AA01', 'FB1000001905AA08', 'FB1000001905AA04',
 'FB1000001905AA00', 'FB1000001905AA12', 'FB1000001905AA14',
 'FB1000001905AA16', 'FB1000001905AA09', 'FB1000001905AA03',
 'FB1000001905AA06', 'FB1000001905AA10']
Length: 20, dtype: string

2.5 Naming Sensors Based on Serial Numbers¶

To enhance clarity and facilitate easier identification of sensors in our dataset, we adopt a naming convention that incorporates the last two digits of each sensor's serial number. This step not only simplifies the analysis but also adds a layer of readability when dealing with multiple sensors. Here's how we accomplish this:

  1. Update Sensor Names: We prefix every sensor name with "Sensor-" and append the last two digits of its serial number using data['serial'].str[-2:]. This approach ensures each sensor has a unique and identifiable name based on its serial number.

  2. Reflecting Changes: After updating the 'sensor_name' column, we create a placeholder DataFrame to preview the changes. This DataFrame includes both 'serial' and 'sensor_name' columns, displaying how the sensor names have been updated based on the serial numbers.

  3. Preview Output: Using .head() on our placeholder DataFrame, we show the first few rows to verify the naming convention's application and ensure the data integrity is maintained.

This naming strategy not only aids in the quick identification of sensors but also adds a systematic approach to managing sensor data, making subsequent data analysis tasks more intuitive.

In [8]:
data['sensor_name'] = "Sensor-" + data['serial'].str[-2:]
placeholder_output_structure_1 = data[['serial', 'sensor_name']]  
placeholder_output_structure_1.head(3) 
Out[8]:
serial sensor_name
0 FB1000001905AA05 Sensor-05
1 FB1000001905AA15 Sensor-15
2 FB1000001905AA17 Sensor-17
In [9]:
missing_values_count = data['sensor_name'].isnull().sum()
print(f"Number of missing values in column_name: {missing_values_count}")
Number of missing values in column_name: 0

2.6 Identifying Missing Values in the Dataset (part 2)¶

A crucial step in data preprocessing involves identifying and quantifying missing values within our dataset. This process enables us to understand the extent of missing data and strategize on how to handle it. We approach this in two main steps:

  1. Identifying Missing Values: We utilize data.isnull().sum() to calculate the total number of missing values in each column of our dataset. This method checks each cell for null values, providing a column-wise sum. The output is printed under "Current Missing Values," offering immediate visibility into which columns contain missing data and to what extent.

  2. Calculating Missing Data Proportion: To grasp the significance of the missing values, we calculate their proportion relative to the total dataset size. This is achieved by dividing the number of missing values in each column by the total number of rows in the dataset, then multiplying by 100 to convert it to a percentage. The results are printed under "Current Missing Values Proportion," giving us a clear picture of how much data is missing across each column.

This assessment not only highlights the areas of the dataset that may require attention due to missing data but also informs our decision-making process regarding data cleaning techniques, such as imputation or removal, ensuring the robustness of our analysis.

In [10]:
missing_values = data.isnull().sum()
print("Current Missing Values")
print(missing_values)

# Calculation of the missing proportion
missing_proportion = (missing_values / len(data)) * 100

print("Current Missing Values Proportion:")
print(missing_proportion)
Current Missing Values
time                            0
serial                          0
sensor_name                     0
status_current_fill_level       0
bin_status                      0
last_update                  9999
product_type                    0
description                     0
address                         0
latlong                      1808
wastebasket_size             9999
date                            0
time_only                       0
dtype: int64
Current Missing Values Proportion:
time                           0.000000
serial                         0.000000
sensor_name                    0.000000
status_current_fill_level      0.000000
bin_status                     0.000000
last_update                  100.000000
product_type                   0.000000
description                    0.000000
address                        0.000000
latlong                       18.081808
wastebasket_size             100.000000
date                           0.000000
time_only                      0.000000
dtype: float64

2.7 Cleaning the Dataset by Handling Missing Data¶

A vital part of preparing our dataset for analysis involves ensuring that it does not contain significant gaps that could distort our findings. To this end, we take two decisive actions to handle missing data effectively:

  1. Dropping Rows with Missing latlong Data: Recognizing the critical nature of geographical data for our analysis, we remove any rows where the latlong information is missing. This is achieved with data.dropna(subset=['latlong']), ensuring that our dataset only includes entries with valid geolocation data. This step is crucial for analyses that depend on spatial information.

  2. Removing Unnecessary Columns: Further refining our dataset, we decide to eliminate columns that are not essential to our specific analytical goals. In this case, last_update and wastebasket_size columns are removed using data.drop(columns=['last_update', 'wastebasket_size']). This decision might be driven by factors such as the columns' relevance to the research question, a high proportion of missing values, or the desire to streamline the dataset for efficiency.

By taking these steps, we ensure our dataset is more focused and manageable, eliminating potential sources of error or bias that could arise from missing or irrelevant data. This process sets a strong foundation for accurate and meaningful analysis.

In [11]:
#Dropping rows with missing 'latlong'
data = data.dropna(subset=['latlong'])
In [12]:
data = data.drop(columns=['last_update', 'wastebasket_size'])

2.8 Updating the new 'date' Column Data Type¶

To ensure our dataset's date column is in the optimal format for date-related operations and analysis, we convert it to a datetime data type. This conversion is crucial for several reasons:

  1. Consistency and Accuracy: Converting the date column to a datetime format using pd.to_datetime(data['date']) ensures that all entries are consistently recognized as dates. This uniformity is vital for accurate sorting, filtering, and time series analysis.

  2. Enhanced Date Functionality: With the date column in datetime format, we unlock a range of Pandas functionalities specific to date and time operations. This includes extracting components of the date (like the month or year), performing date arithmetic, and efficiently handling time zones if necessary.

This step is a fundamental part of data preprocessing, setting the stage for any analysis or operations that rely on temporal data. It ensures our dataset is accurately prepared to support insightful and reliable temporal analyses.

In [13]:
data['date'] = pd.to_datetime(data['date'])
In [14]:
# Parsing the datetime to ensure correct format
# data['time_only'] = pd.to_datetime(data['time_only'], errors='coerce')  

2.9 Re Checking Current Data Types in the Dataset¶

After performing various data preprocessing steps, including handling missing values and optimizing data types, it's important to review the current state of our dataset's structure. This involves checking the data types of each column to ensure they align with our expectations and requirements for analysis. Here's how we do it:

  1. Print Statement: We use a simple print statement to introduce the context, stating "Current data types".

  2. Displaying Data Types: The data.dtypes command is executed to list each column's data type within our DataFrame. This output is crucial for verifying that our data preprocessing steps have been successfully applied, ensuring that each column is now properly formatted for efficient and accurate analysis.

By reviewing the data types, we confirm the dataset's readiness for further analysis, ensuring that numerical data can be correctly manipulated and that categorical and date-time information is appropriately formatted.

In [15]:
print("Current data types")
print(data.dtypes)
Current data types
time                         datetime64[ns, UTC]
serial                            string[python]
sensor_name                       string[python]
status_current_fill_level                  int64
bin_status                              category
product_type                            category
description                       string[python]
address                           string[python]
latlong                                   object
date                              datetime64[ns]
time_only                                 object
dtype: object

2.10 Analyzing 'latlong' Column for Completeness¶

Ensuring the completeness and accuracy of geolocation data in our 'latlong' column is crucial for spatial analysis. To this end, we conduct a two-fold examination to identify any missing or blank entries:

  1. Missing Values Check: We calculate the total number of missing (NaN) values in the 'latlong' column using data['latlong'].isna().sum(). This step helps us understand if there are any entries that lack geolocation data entirely, which could impact geospatial analyses and visualizations.

  2. Blank Cells Check: To identify entries that might appear non-empty but contain only blank spaces, we use data['latlong'].str.strip().eq('').sum(). This process strips any leading and trailing spaces from each entry before comparing it to an empty string, allowing us to count how many cells are effectively blank despite not being NaN.

The results of these checks are printed, providing insight into the data quality of the 'latlong' column. Identifying and quantifying both missing and blank values are essential steps in data cleaning, ensuring that subsequent spatial analyses are based on complete and accurate data.

In [16]:
# Check for missing values in 'latlong' column
missing_values_count = data['latlong'].isna().sum()
print(f"Missing values in 'latlong': {missing_values_count}")
 
# Check for blank cells in 'latlong' column
blank_cells_count = data['latlong'].str.strip().eq('').sum()
print(f"Blank cells in 'latlong': {blank_cells_count}")
Missing values in 'latlong': 0
Blank cells in 'latlong': 0

2.11 Inspecting a Non-Null 'latlong' Entry¶

To better understand the format and structure of our geolocation data within the 'latlong' column, we inspect a sample non-null entry. This examination is pivotal for planning any required data transformation or extraction steps, especially when preparing for spatial analysis. Here's our approach:

  1. Filtering Non-Null Entries: We start by removing all null entries from the 'latlong' column using data['latlong'].dropna(), ensuring we're only working with valid geolocation data.

  2. Sample Inspection: If the filtered series is not empty, indicating the presence of non-null geolocation data, we print the first non-null entry using .iloc[0]. This provides us with a clear example of the data format we're dealing with.

  3. Handling Absence of Data: In cases where no non-null entries exist within the 'latlong' column, we print a message stating "No non-null entries found in 'latlong'." This helps in identifying datasets that may lack the necessary geolocation data for spatial analysis.

Inspecting a sample entry allows us to understand the formatting of our geolocation data, informing any necessary preprocessing steps to extract latitude and longitude for analysis purposes.

In [17]:
# Print the structure of a non-null 'lat_long' entry
non_null_entries = data['latlong'].dropna()
if not non_null_entries.empty:
    print(non_null_entries.iloc[0])
else:
    print("No non-null entries found in 'latlong'.")
{'lon': 144.966267, 'lat': -37.802983}

2.12 Extracting Latitude and Longitude¶

In our dataset, geolocation data within the 'latlong' column is stored in a dictionary format, necessitating the extraction of latitude and longitude into separate columns for more straightforward spatial analysis. The steps we take are as follows:

  1. Latitude Extraction: We create a new column, 'latitude', by applying a function to each entry in the 'latlong' column. This function checks if the entry is a dictionary and, if so, retrieves the value associated with the 'lat' key. If the entry isn't a dictionary, it assigns None.

  2. Longitude Extraction: Similarly, we generate a 'longitude' column by applying a function that extracts the value associated with the 'lon' key from each dictionary entry in the 'latlong' column. Entries that do not conform to the expected dictionary format are assigned None.

  3. Verification: To ensure our extraction process was successful, we display the first few rows of the new 'latitude' and 'longitude' columns using .head(). This step is crucial for verifying that the extraction process has correctly populated these new columns with the intended data.

This extraction method allows us to separate and utilize the geolocation data efficiently, laying the groundwork for detailed spatial analysis and visualization.

In [18]:
# Extract 'latitude' and 'longitude' from the dictionary in 'lat_long' column, using the correct keys
data['latitude'] = data['latlong'].apply(lambda x: x.get('lat') if isinstance(x, dict) else None)
data['longitude'] = data['latlong'].apply(lambda x: x.get('lon') if isinstance(x, dict) else None)
 
# Display the first few rows to verify the new columns
print(data[['latitude', 'longitude']].head())
    latitude   longitude
0 -37.802983  144.966267
1 -37.802017  144.965500
2 -37.803217  144.966100
3 -37.803117  144.965233
5 -37.802233  144.966500

2.13 Confirming Data Preprocessing with a New DataFrame¶

After meticulous data cleaning and preprocessing, including handling missing values, optimizing data types, and extracting critical geolocation information, we consolidate our efforts into a new DataFrame. This new DataFrame serves as a snapshot to confirm the effective application of our preprocessing steps. The process involves the following:

  1. Selection of Key Columns: We create new_DF by selecting columns that are essential for our subsequent analyses. These columns include 'date', 'time_only', 'serial', 'sensor_name', 'status_current_fill_level', 'latitude', and 'longitude'. This selection not only reflects the data cleaning and transformation efforts undertaken but also aligns with our analysis objectives by focusing on crucial data points.

  2. DataFrame Creation: The new DataFrame, new_DF, is instantiated with the chosen columns, providing a streamlined and focused dataset for analysis. This step is crucial for verifying that our preprocessing tasks have been successfully applied and that the dataset is now in a suitable format for analysis.

  3. Review and Verification: Displaying new_DF allows us to visually confirm the presence and correct formatting of the selected columns, ensuring that our dataset is ready for the next stages of our project.

This consolidation into a new DataFrame is a critical step in our data preparation process, enabling us to proceed with confidence in the quality and relevance of our data for analysis.

In [19]:
new_DF = data[['date', 'time_only','serial','sensor_name','status_current_fill_level','latitude','longitude']]
new_DF.head(3) 
Out[19]:
date time_only serial sensor_name status_current_fill_level latitude longitude
0 2021-06-29 21:18:25 FB1000001905AA05 Sensor-05 14 -37.802983 144.966267
1 2021-06-30 12:17:40 FB1000001905AA15 Sensor-15 49 -37.802017 144.965500
2 2021-07-01 00:00:21 FB1000001905AA17 Sensor-17 100 -37.803217 144.966100

2.14 Downloading cleaded data¶

This script facilitates the downloading of the dataset necessary for creating a Power BI dashboard, which is aimed at improving data visualization capabilities. To ensure the code functions correctly, it is important to remove the hash ('#') symbols that denote comments, and adjust the file path as needed to correctly locate and retrieve the dataset. By doing so, you will streamline the setup process, enabling a smoother and more efficient integration into the Power BI platform for effective data analysis and visualization.

In [20]:
# Specify the path and file name for the CSV file you want to create
#file_path = "C:/Users/dpmdj/Downloads/Pre_processed_Smart_bin_data.csv"  # Adjust the path as needed
 
# Export the DataFrame to a CSV file
#new_DF.to_csv(file_path, index=False)
 
# If you need to provide a link to download the file (in a Jupyter environment, for example)
#print(f"Data exported to {file_path}")

3.0 Visualization Part 1 (Plotting the map and finding the best route)¶

3.1 Visualizing Geolocation Data with Folium¶

In our project, we leverage the Folium library to create interactive maps that can visualize the geolocation data we've prepared. The first step involves installing Folium, a powerful Python library used for visualizing geospatial data. Here’s how we proceed:

  1. Installation: We begin by installing Folium using pip install folium, ensuring that we have access to its comprehensive mapping capabilities.

  2. Importing Folium: With Folium installed, we import it into our notebook to start creating maps.

  3. Setting Coordinates: We define the coordinates for Argyle Square, which will serve as the focal point of our map. The latitude (-37.8036) and longitude (144.9655) values are set to center the map precisely on Argyle Square.

  4. Creating the Map: We instantiate a Folium map object using folium.Map(), passing in the coordinates of Argyle Square along with a zoom_start value of 17 to provide a detailed view of the area. The tiles parameter is set to 'CartoDB positron' for a clean and modern map aesthetic.

This step is crucial for visualizing the spatial distribution of our data points, offering insights into patterns and trends that may not be apparent through numerical analysis alone. The interactive nature of Folium maps further enhances our ability to explore and interpret the data effectively.

In [21]:
pip install folium
Requirement already satisfied: folium in c:\users\dpmdj\anaconda3\lib\site-packages (0.16.0)
Requirement already satisfied: branca>=0.6.0 in c:\users\dpmdj\anaconda3\lib\site-packages (from folium) (0.7.1)
Requirement already satisfied: jinja2>=2.9 in c:\users\dpmdj\anaconda3\lib\site-packages (from folium) (3.1.3)
Requirement already satisfied: numpy in c:\users\dpmdj\anaconda3\lib\site-packages (from folium) (1.26.4)
Requirement already satisfied: requests in c:\users\dpmdj\anaconda3\lib\site-packages (from folium) (2.31.0)
Requirement already satisfied: xyzservices in c:\users\dpmdj\anaconda3\lib\site-packages (from folium) (2022.9.0)
Requirement already satisfied: MarkupSafe>=2.0 in c:\users\dpmdj\anaconda3\lib\site-packages (from jinja2>=2.9->folium) (2.1.3)
Requirement already satisfied: charset-normalizer<4,>=2 in c:\users\dpmdj\anaconda3\lib\site-packages (from requests->folium) (2.0.4)
Requirement already satisfied: idna<4,>=2.5 in c:\users\dpmdj\anaconda3\lib\site-packages (from requests->folium) (3.4)
Requirement already satisfied: urllib3<3,>=1.21.1 in c:\users\dpmdj\anaconda3\lib\site-packages (from requests->folium) (2.0.7)
Requirement already satisfied: certifi>=2017.4.17 in c:\users\dpmdj\anaconda3\lib\site-packages (from requests->folium) (2024.2.2)
Note: you may need to restart the kernel to use updated packages.
In [22]:
# Create the Map Centered on Argyle Square
import folium
 
# Coordinates for Argyle Square
argyle_square_lat = -37.8036
argyle_square_lon = 144.9655
 
# Create a map centered around Argyle Square with a suitable zoom level
map = folium.Map(location=[argyle_square_lat, argyle_square_lon], zoom_start=17, tiles='CartoDB positron')

3.2 Adding Markers to the Map for Each Location¶

With our Folium map centered on Argyle Square, the next step enhances our spatial analysis by marking the precise locations of interest within our dataset. Here’s how we systematically add markers to our map:

  1. Iterating Through Data: We loop through each row in our new_DF DataFrame, which contains the cleaned and preprocessed data, including the essential 'latitude' and 'longitude' columns for our geolocation points.

  2. Marker Creation: For each row, we create a Folium marker using folium.Marker(). This marker is set to the geographical coordinates found in the 'latitude' and 'longitude' columns of the row.

  3. Popup Information: To each marker, we add a popup that displays the latitude and longitude of the location. This popup is activated when the marker is clicked, providing quick access to precise location data.

  4. Adding Markers to the Map: Each marker is then added to our previously created Folium map object with .add_to(map). This action places the marker on the map at the designated coordinates.

By marking each location on the map, we visually represent our dataset's geospatial distribution. This method not only aids in identifying patterns or clusters in the data but also facilitates a more interactive and engaging analysis, allowing users to explore individual data points through their geographic context.

In [23]:
# Add Markers for All Latitude and Longitude Values
for index, row in new_DF.iterrows():
    folium.Marker(
        [row['latitude'], row['longitude']],
        popup=f"Lat: {row['latitude']}, Lon: {row['longitude']}"
    ).add_to(map)

3.3 Visualizing Sensor Locations on a Map¶

To further enhance our spatial analysis, we create a simplified map centered around Argyle Square, using it as a base to display the locations of various sensors. This visual representation is key to understanding the geographical distribution of sensors and assessing any spatial patterns. Here's the step-by-step process:

  1. Map Creation: We start by generating a new Folium map object, simple_map, centered on Argyle Square. The map is initialized with a zoom level of 17, providing a close-up view of the area for detailed inspection.

  2. Adding Markers: We iterate over our dataset, adding a marker for each sensor's location. Each marker is configured with the sensor's latitude and longitude, ensuring accurate placement on the map.

  3. Popup Feature: For added interactivity, each marker is equipped with a popup that displays the sensor's latitude and longitude. This feature allows users to click on any marker to reveal the exact coordinates of the sensor, enhancing the map's informational value.

  4. Displaying the Map: Finally, the simple_map object is displayed, showing all sensor locations as markers. When using a Jupyter notebook, the map will automatically render within the notebook interface, providing an immediate and interactive visual analysis tool.

By incorporating these steps, we create a comprehensive spatial visualization that allows for an intuitive exploration of sensor data, aiding in the identification of patterns, clusters, or gaps in sensor coverage.

In [24]:
# Create a map centered around the average location
Bin_location_map = folium.Map(location=[argyle_square_lat, argyle_square_lon], zoom_start=17)
 
# Loop through the DataFrame and add each bench location as a marker on the map
for index, row in new_DF.iterrows():
    folium.Marker(
        location=[row['latitude'], row['longitude']],
        popup=f"Lat: {row['latitude']}, Lon: {row['longitude']}"
    ).add_to(Bin_location_map)
 
# To display the map in a Jupyter notebook
Bin_location_map
Out[24]:
Make this Notebook Trusted to load map: File -> Trust Notebook
In [25]:
missing_values_count = data['sensor_name'].isnull().sum()
print(f"Number of missing values in column_name: {missing_values_count}")
Number of missing values in column_name: 0

4.0 Visualization Part 2 (Finding the best route)¶

4.1 Optimizing Operational Efficiency through Data Normalization¶

In our pursuit of operational excellence, particularly in optimizing collection routes based on bin fill levels, it's imperative to standardize the data we base our decisions on. The process of normalizing the fill levels of bins to a uniform scale is a critical step in this journey. Here’s how and why we do it:

4.2 Step-by-Step Approach:¶

  • Normalization Process: We convert the status_current_fill_level values in our new_DF DataFrame to a standardized scale ranging from 0 to 1. This is done by dividing each value by the maximum possible fill level, which is 100. This adjustment is encapsulated in the following line of code:

    new_DF['status_current_fill_level'] = new_DF['status_current_fill_level'] / 100
    
In [26]:
import pandas as pd
import numpy as np
from scipy.spatial.distance import cdist

# Normalize fill_level if it's not already (max fill_level is 100 )
#new_DF['status_current_fill_level'] = new_DF['status_current_fill_level'] / 100

4.3 Calculating the Pairwise Distance Matrix¶

A critical step in optimizing routes and analyzing spatial relationships within our dataset is calculating the pairwise distance matrix. This matrix helps us understand the distances between all pairs of points (in this case, bins) based on their geographic coordinates. Here’s a detailed breakdown of the process:

4.4 Steps and Code:¶

  • Coordinate Extraction: We start by extracting the latitude and longitude values from our new_DF DataFrame into a new variable, coords. This subset contains just the essential geographic coordinates for each location.

    coords = new_DF[['latitude', 'longitude']]
    
  • Distance Matrix Calculation: Utilizing the cdist function from SciPy's spatial distance library, we compute the Euclidean distance between each pair of points in coords. The result, dist_matrix, is a matrix where each element represents the distance between a pair of locations.

In [27]:
# Calculate pairwise distance matrix
coords = new_DF[['latitude', 'longitude']]
dist_matrix = cdist(coords, coords, metric='euclidean')
print(dist_matrix)
[[0.         0.00123347 0.00028748 ... 0.00078536 0.00058005 0.00062876]
 [0.00123347 0.         0.00134164 ... 0.00102306 0.00067859 0.00184459]
 [0.00028748 0.00134164 0.         ... 0.00106219 0.00066339 0.00053151]
 ...
 [0.00078536 0.00102306 0.00106219 ... 0.         0.00078355 0.001334  ]
 [0.00058005 0.00067859 0.00066339 ... 0.00078355 0.         0.00117004]
 [0.00062876 0.00184459 0.00053151 ... 0.001334   0.00117004 0.        ]]

4.5 Route Optimization Based on Fill Levels and Proximity¶

To enhance operational efficiency in servicing bins, we employ a strategy that prioritizes bins with higher fill levels and minimizes travel distance. This two-pronged approach ensures that the most critical bins are serviced first while also optimizing the route taken to reduce time and resources spent. Here’s how we implement this strategy:

4.6 Prioritizing by Fill Level:¶

  • Sorting Locations: We begin by sorting the locations based on their fill levels in descending order. This is achieved by using np.argsort on the negative values of the status_current_fill_level column in new_DF. The result, priority_indices, represents the indices of locations sorted by their priority, with higher fill levels first.
In [28]:
# Sort locations by descending fill level (prioritize higher fill levels)
priority_indices = np.argsort(-new_DF['status_current_fill_level'])

4.7 Route Optimization:¶

  • Initializing the Route: The route list is initialized with the starting point, chosen as the first location in our prioritized list. This represents the location with the highest fill level.
In [29]:
# Initialize route list with the starting point (assuming first location as start)
route = [priority_indices[0]]

4.8 Determine Next Stop:¶

To determine the next stop, we follow a simple heuristic: from the current location, move to the nearest unvisited location. For each subsequent location:

  • Calculate the distances from the last visited location to all others using the precomputed distance matrix, dist_matrix.

  • Temporarily set the distance to already visited locations to infinity to exclude them from consideration.

  • Identify the next stop as the location with the minimum distance to the last visited location. This location is then added to the route.

This methodical approach to route optimization, prioritizing bins by their necessity for service (based on fill levels) and sequentially choosing the next nearest location, significantly enhances efficiency. It not only ensures the most critical bins are attended to promptly but also minimizes the operational costs associated with travel distances.

In [30]:
for _ in range(1, len(priority_indices)):
    last_visited = route[-1]
    distances_to_last_visited = dist_matrix[last_visited]
    # Filter out already visited locations
    distances_to_last_visited[route] = np.inf
    next_stop = np.argmin(distances_to_last_visited)
    route.append(next_stop)

4.9 Visualization of the Optimized Route¶

To further refine our visualization and provide a comprehensive view of the optimized route, we employ Folium to map out each stop with markers and visually connect these stops with lines. This approach not only highlights the individual locations but also delineates the sequence and path of the route, offering a holistic view of the journey. Here's the step-by-step process:

4.10 Setting Up the Map:¶

  • Centering the Map: We calculate the central point of our route by averaging the latitude and longitude of all stops. This ensures the map is optimally centered, making all route points easily viewable.
In [31]:
# Visualization
map_center = [new_DF['latitude'].mean(), new_DF['longitude'].mean()]
m = folium.Map(location=map_center, zoom_start=20)

4.11 Marking Stops:¶

  • Placing Marker: Each location along the route is marked on the map with a blue marker. These markers are annotated with popups that display the fill level percentage, providing immediate insight into each stop's priority.
In [32]:
# Add markers for each location in the route
#for i, idx in enumerate(route):

for idx in route:
    folium.Marker([new_DF.iloc[idx]['latitude'], new_DF.iloc[idx]['longitude']],
                  popup=f"Fill Level: {new_DF.iloc[idx]['status_current_fill_level']}%",
                  icon=folium.Icon(color="blue", icon="info-sign")).add_to(m)
In [33]:
# Add a marker for the starting point
start_idx = route[0]
folium.Marker(
    [new_DF.iloc[start_idx]['latitude'], new_DF.iloc[start_idx]['longitude']],
    popup="Start",
    icon=folium.Icon(color="purple", icon="play")
).add_to(m)

# Add a marker for the end point
end_idx = route[-1]
folium.Marker(
    [new_DF.iloc[end_idx]['latitude'], new_DF.iloc[end_idx]['longitude']],
    popup="End",
    icon=folium.Icon(color="blue", icon="stop")
).add_to(m)

def get_marker_color(fill_level):
    """Determine the marker color based on fill level."""
    if fill_level <= 30:
        return "green"
    elif fill_level <= 70:
        return "orange"
    else:
        return "red"

for idx in route:
    fill_level = new_DF.iloc[idx]['status_current_fill_level']
    folium.Marker(
        [new_DF.iloc[idx]['latitude'], new_DF.iloc[idx]['longitude']],
        popup=f"Fill Level: {fill_level}%",
        icon=folium.Icon(color=get_marker_color(fill_level), icon="info-sign")
    ).add_to(m)

4.12 Illustrating the Route:¶

  • Connecting with Lines: To visualize the travel path, we draw red lines between consecutive stops along the route. This is achieved by iterating through the route and drawing a polyline between each pair of locations.
In [34]:
# Display route as lines
for i in range(len(route) - 1):
    start_pos = [new_DF.iloc[route[i]]['latitude'],new_DF.iloc[route[i]]['longitude']]
    end_pos = [new_DF.iloc[route[i+1]]['latitude'], new_DF.iloc[route[i+1]]['longitude']]
    folium.PolyLine([start_pos, end_pos], color='red').add_to(m)

#for i in range(len(route)-1):
    #folium.PolyLine([new_DF.iloc[route[i]][['latitude', 'longitude']].values,
                    # new_DF.iloc[route[i+1]][['latitude', 'longitude']].values],
                    #color="red").add_to(m)
# Save the map to an HTML file
m.save('optimized_route_map.html')
# To display the map in a Jupyter notebook
m
Out[34]:
Make this Notebook Trusted to load map: File -> Trust Notebook

5.0 Conclusion¶

In conclusion, the integration of smart bin technology along with the development of the Smart Bins Monitoring Dashboard has revolutionized waste management operations in Melbourne. This comprehensive dashboard, which vividly displays the time series analysis of fill percentages and the bin location map, provides a clear, real-time visualization of waste levels across the city. Such insights are invaluable for optimizing collection routes and schedules.

The dashboard highlights key metrics such as the average fill percentage and displays trends over time, enabling decision-makers to monitor efficiency and make data-driven adjustments to waste management strategies. The interactive map with color-coded bin statuses allows for immediate identification of critical areas needing attention, thereby streamlining operations and reducing unnecessary travel.

This project illustrates a successful application of technology in urban management, significantly enhancing operational efficiency while supporting Melbourne's environmental sustainability objectives. The Smart Bins Monitoring Dashboard serves as an exemplary model for other cities looking to improve their waste management systems through innovative technology and data analytics, paving the way for a cleaner, more efficient, and sustainable future.

6.0 References¶

  • Melbourne Urban Waste Management Data: City of Melbourne’s open data portal.
  • Pandas: Utilized for data manipulation and analysis, providing powerful data structures to handle complex data operations.
  • NumPy: Essential for handling numerical data, offering extensive functionalities for mathematical computations.
  • Folium: Facilitates the creation of interactive maps to visually represent geographic data, enhancing the user's understanding of spatial relationships.
  • os: A Python standard library for interacting with the operating system, useful for file and directory operations, and managing environment variables.